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Abstract 

We consider finite horizon reach-avoid problems for discrete time stochastic systems. Our goal is to construct upper bound 
functions for the reach-avoid probability by means of tractable convex optimization problems. We achieve this by restricting 
attention to the span of Gaussian radial basis functions and imposing structural assumptions on the transition kernel of the 
stochastic processes as well as the target and safe sets of the reach-avoid problem. In particular, we require the kernel to be 
written as a Gaussian mixture density with each mean of the distribution being affine in the current state and input and the 
target and safe sets to be written as intersections of quadratic inequalities. Taking advantage of these structural assumptions, 
we formulate a recursion of semidefinite programs where each step provides an upper bound to the value function of the reach- 
avoid problem. The upper bounds provide a performance metric to which any suboptimal control policy can be compared, and 
can themselves be used to construct suboptimal control policies. We illustrate via numerical examples that even if the resulting 
bounds are conservative, the associated control policies achieve higher reach-avoid probabilities than heuristic controllers for 
problems of large state-input space dimensions (more than 20). The results presented in this paper, far exceed the limits of 
current approximation methods for reach-avoid problems in the specific class of stochastic systems considered. 


Key words: Reachability, approximate dynamic programming, value function bounds, Markov decision processes, stochastic 
control. 


1 Introduction 

We deal with Markov decision processes of specific structure and focus on the stochastic reach-avoid problem where 
one maximizes the probability of reaching a target set while staying in a safe set, under given control constraints. This 
type of optimal control problem can be solved by a dynamic programming recursion [1,2] but its practical applications 
are limited due to the infamous curse of dimensionality that affects state-of-the-art space-gridding techniques. Despite 
their limitations, space-gridding methods are theoretically attractive since they provide straightforward ways to 
estimate approximation errors with respect to the optimal solution function, under general Lipschitz continuity 
assumptions on the associated system dynamics [3,4,5]. In an attempt to extend the possible applications of the 
reach-avoid problem, we follow a less standard approach [6] than dynamic programming to characterize the optimal 
value function of the stochastic reach-avoid problem. We then construct an upper bound for the optimal value 
function and evaluate in practice the performance of the associated approximate control policy . 

The main tools used in this work stem from convex optimization and methods used to reformulate robust inequalities 
[7,8,9] into semidefinite constraints. The authors in [7] used a similar framework and showed its success in terms of 
control performance in a different type of optimal control problem. The framework considered here is fundamentally 
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different due to the different type of basis approximation functions used and the way that reach-avoid problems handle 
state and control constraints naturally via the definition of their cost function. Another approach that addresses 
the approximation of reach-avoid problems based on the method suggested in [6] is presented in [10,11]- Compared 
to the results in [11], the technique developed here provides deterministic upper bounds to the reach-avoid value 
function (as opposed to probabilistic ones), scales to significantly higher dimensional problems, but requires stronger 
restrictions on the kernel of the stochastic processes. 

Our results are based on restricting the decision space of an infinite dimensional linear program [6,11] to the space 
spanned by a finite collections of Gaussian radial basis functions (RBFs) and then searching for means, variances 
and weights of a sum of such functions to upper bound the reach-avoid probability. We show that for some process 
kernels, such as the kernel of a linear system with additive Gaussian mixture noise, this restriction allows us to 
analytically compute the expectation of the stochastic process over the basis functions and reduce the problem of 
upper bounding the optimal reach-avoid value function to determining the parameters of a Gaussian RBF sum that 
upper bounds another such sum. We derive sufficient conditions in terms of linear matrix inequalities (LMIs) to 
characterize dominance between Gaussian RBF sums and then use them to formulate a semidefinite program (SDP) 
with which one can construct an upper bound to the optimal value function at each step of the reach-avoid recursion. 
We show via numerical examples that approximate control policies defined using these bounds scale to problems of 
significantly higher dimensions than what is currently possible with state gridding techniques. 

The rest of the paper is organized as follows: In Section 2 we recall the basic definition of the reach-avoid problem 
for Markov decision processes and formulate a robust optimization problem whose solution is an upper bound to 
the reach-avoid value function. We focus most of Section 3 to reformulate the constraints of the robust optimization 
problem into sufficient conditions that can be checked efficiently with existing convex optimization software. Using 
these conditions we provide an algorithm to construct upper bounds for the value function of every step of a reach- 
avoid recursion. We then use the constructed bounds to define approximate control policies that maximize the 
reach-avoid probability. In Section 4 we provide numerical examples to estimate the suboptimality gap between the 
constructed bounds and the optimal value function (computed via gridding) and illustrate that the approximate 
control policies can outperform heuristic controllers even when the bounds are not tight. 

2 Stochastic Reach-Avoid Problem 

2.1 Dynamic programming approach 

Let X = R” denote a continuous state space and U C R m a continuous and compact control space. We consider a 
discrete-time controlled stochastic process x t +\ ~ Q(dx\x t ,u t ), (x t ,u t ) € X x U with a transition kernel Q : B(X) x 
X xU -£ [ 0,1] where B(X) denotes the Borel u-algebra of X. Given a state control pair (x t , u t ) € X x U , Q(A\x t , u t ) 
measures the probability of x t +\ being in a set A £ B(X). The transition kernel Q is a Borel-measurable stochastic 
kernel, that is, Q(A\-) is a Borel-measurable function on X x U for each A £ B(X) and Q(-\x,u) is a probability 
measure on X for each (x,u). For the rest of the paper all measurability conditions refer to Borel measurability. We 
consider a safe set K' £ B(X) and a target set K C K'. We define an admissible T-step control policy to be a sequence 
of measurable functions ir = {7To,..., 7 Tt-i} where 7Tj : X —> IA for each i £ {0,..., T — 1}. The reach-avoid problem 
over a finite time horizon T is to find an admissible T-step control policy that maximizes the probability of Xt reaching 
the set K at some time tj< < T while staying in K' for all t <tx- For any initial state Xq we denote the reach-avoid 
probability associated with a given n as: r^ 0 (K,K') = P£ o {3j £ [0, T] : Xj £ K A Vi £ [0 ,j — 1], Xi £ K' \ K} and 
operate under the convention that [0,-1] = 0, which implies that the requirement on i is automatically satisfied 
when Xo £ K. 

In [2], r^ 0 (K,K') is shown to be equivalent to the following sum multiplicative cost function: 


where nU(') = lif k > j. The function 1a(%) denotes the indicator function of a set A £ B{X) with 1 a{x) = 1 if 
x £ A and l^)#) = 0 otherwise. The sets I\ and K' can be time dependent or even stochastic [12] but for simplicity 


i -1 


n 1 K'\k{Xi) 1 x(Xj) 


i =0 
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we assume here that they are constant. We denote the difference between the safe and target sets by X := K' \K 
to simplify the presentation of our results. 

The solution to the reach-avoid problem is given by a dynamic recursion [2]. Define : A —>■ [0, 1] for k = T— 1,..., 0 
by : 

V£(x) = swpit K (x) + tx(x) [ Vk +1 (y)Q{dy\x,u)\ 

ueu l Jx J F) 

Vt{x) = t K (x). 

The value of the above recursion at k = 0 and for any initial state Xq is the supremum of (1) over all admissible 
policies, i.e. Vq (T’o) = sup w rj 0 ( K , K') [2] . In [11] the authors proved measurability of the value functions (and hence 
attainability of the supremum) in (2) under the assumption that Q is continuous with respect to u G U for any given 
x € X . For the rest of this work we operate under the same kernel continuity assumption: 

Assumption 1 (Kernel continuity) For every x € X,A € B(X) the mapping u K»• Q(A\x,u) is continuous. 

We define the optimal reach-avoid feedback policy at each state k by 


TT* k (x) = argrnax ^t K (x) + tx{x) V£ +1 {y)Q(dy\x, u) j . (3) 

The only established way to approximate the solution of (2) is by gridding X x U and computing it backwards, 
starting from the known value function V^,{x) = \k(x). In this way, the value of Vq (x) is approximated on the grid 
points of X while an approximate control policy at each step k is computed by taking the maximum of (3) over 
the grid points of IA. The advantages of this approach are that it is straightforward to estimate the approximation 
accuracy as a function of the grid size (under suitable regularity assumptions [3,4]) and that the approximate feedback 
control policy can be stored as a look-up table rather than computed online at each state. The disadvantage is that it 
quickly becomes intractable as the dimensions of the state and control spaces increase. In what follows we formulate 
an infinite dimensional linear program equivalent to the reach-avoid recursion and use tools from robust optimization 
to construct upper bound functions. 

2.2 Robust optimization problem 

Inspired by the linear programming approach to dynamic programming [6], the authors in [11] formulated very 
similar version to the following infinite dimensional linear program over the space of measurable functions F = {/ : 
X —> R, / measurable} to characterize the reach-avoid value function at time k: 


min / V(x)i , (dx) 

v(-)er Jx 

subject to V(x) > 1 k{x) + tx{x) / V£ +1 {z)Q(dz \x,u), V(x, u) € X x U 

.he 


(4) 


where we use min instead of inf since, as shown in [11], Assumption 1 ensures attainability of the optimizer. The 
constraints in (4) are equivalent to the pair of constraints 


V(x) > 1, Vx e K 

V{x)> f V£ +1 (z)Q(dz\x,u), V(x,u) G X x U. 

Jx 

The measure v is interpreted as a state relevance measure [6] that can be chosen to bias the value of the cost 
function in different places of the state space. In order to search for a feasible point in (4), we restrict the infinite 
dimensional function space J 7 to a finite dimensional subspace T. Assume that V£ +l is known and consider the 
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following semi-infinite linear program defined over functions V G T C T 


min / V(x)^(dx) 
veF Jx 

subject to V(x)> I V£ +1 {z)Q{dz\x,u), \/(x,u) £ X x U ^ 

Jx 

V(x) > 1, Vx G K. 

To guarantee that the minimum is attained in (5), additional assumptions may be required on T. For instance, T 
would have to be bounded to exclude situations where the optimization program has an unbounded cost function, e.g. 
—oo on the complement of X which is unconstrained. An alternative approach is to explicitly add constraints on the 
value of V on the complement of X, such as V(x) > 0,Vx G X\X since V represents a reach-avoid probability. The 
rest of the section is devoted to reformulating the constraints of (5) into conditions that can be checked efficiently 
using convex optimization software. 


3 Robust constraint reformulation 


Checking the constraints in (5) numerically is, in general, intractable due to the infinite cardinality of the constraint 
set. Depending on the structure of K, X x U, the function f x V£ +1 (z)Q (dz | x,u) and the representation of V in J -, 
it may be possible to reformulate the constraints. In this section we impose specific structure to these elements and 
derive sufficient conditions to construct feasible solutions for (5). 


3.1 Gaussian radial basis functions 


Let the restricted decision space be F — F M where T M denotes the span of a set {(j>(x, Hi, £,)}of M G N + 
Gaussian radial basis functions (RBFs) each one defined as, 




1 

— / exp 

v^mi 



/r i ) T S i *(x 



( 6 ) 


where the operator | • | denotes the determinant of the argument matrix. Note that each function <p(x, /j,, E. ( ) 
corresponds to the density function of a multivariate Gaussian random variable A/"(/q, E$) with x, fj,i G R", E* G S” 
where S" denotes the space of (symmetric) positive definite matrices. We use the standard notation X 0 to denote 
that a matrix X G R" xn is in S!f and hence we have that E, 0. Every element of if G T M can be written as a 
Gaussian RBF sum defined as: 


Definition 1 (Gaussian RBF sum) A Gaussian RBF sum if) : X — > 3? is a linear combination of functions in the 
form of (6), i.e. ip{ x ) = YloLi Wi<f>(x, /iq, Ej) for some MgN and w G R M with Ej )p 0 and Hi G 3?” for i = 1,..., M 
where n denotes the dimension of X. 

In what follows, we derive conditions by which one can choose {w l) //,, E i}f^i to upper bound the reach-avoid value 
function. We restrict ourselves to quadratic sets, defined as: 

Definition 2 (Quadratic set in R”) A quadratic set is any set A C R™ that can be written as 


( 

r , 

T 

r 


X 


X 

l x G R” 

1 

Aj 

1 


>0, VjG{l,...,iV} 


for some N G N + and symmetric matrices Aj G R” +1 . 


Sets satisfying the condition in Definition 2 include intersections of ellipsoids and half spaces that can be used to 
approximate convex sets arbitrarily closely. However, quadratic sets are not a subset of convex sets since one can, 
for example, express the difference between ellipsoids and half spaces as a quadratic set which implies that a wide 
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range of non-convex sets satisfy Definition 2. Given two function elements in T M , the following lemma establishes 
sufficient conditions on their weights, means and variances such that one uniformly dominates the other on a given 
quadratic set: 


Lemma 3 (RBF dominance) Let {wi, Pi, and {wi, pi,Hi}f£. 1 denote the weights, means and variances 

of two Gaussian RBF sum functions 'll), ip € J- M and assume the weights {wi}f£ 1 are non-negative. Consider a 
quadratic set A C M" described by N inequalities. We have that ip{x) > ip(x),\/x £ A if for all i = 1 
diy > r n 0, such that JQ )>= 0 


Ql) Q iM 

Qji Qi + Qfii&i - i bjAj 


where 


Qi 


-1 

o 

o 

_1 


O 

O 


O 

i—I 


E" 1 -E j 1 pi " 


II 

I ~Ti_ 

> Qti — 

0 Eri_ 


-^E" 1 pJZf'pp 


and 0,1 denote zero and identity matrices of appropriate dimensions. The tei'ms Ti > r jv 0 denote element-wise 
non-negativity where each Ti is an element of space . 


PROOF. Notice that a sufficient condition to ensure that ip(x) > ip(x), \/x £ A is that Wi<j>(x, /q, E$) > Wi<p(x, Pi, E;), \/x £ 


A, i = 1 Consider x £ R" and let z = 


£ R” +1 . Since all Wi are assumed non-negative, the 


weights u>i also have to be non-negative and we can take the natural logarithm on the above relation to get 
log [u>i(p{x, pi, Ej) ) - log (■ Wi<p(x, pi, E {)) >0, Vx £ A if and only if 


z T AjZ >0, j = 1,..., N => z T - Q jlu t i + Qi ) z>Q 


(7) 


where Q^^s,, Q£ and Qi are defined as 


Qm,T.i = 


s,- 1 

-TjZf 1 pjT,~ l p : 


f Qp,i,£i — 


s- 1 

'Tv-1 


~^i 1 Ri 
Tv-l; 


Ti S-V 


Qi = 


o 


0 2 log 


wy/ti. 


Applying the S-procedure to the condition in (7), we have that if 


N 


3Tij > 0 such that - Q fl ^£ i + Qi ~ ^ n,jAj )p 0 

j—i 


then 


z 1 AjZ >0,j = l,...,N => z T - Qp u ± i +Qi) z > 0. 
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Consider now the transformation Q, y . = Qj^.Qy.Qa- with Q; l% = 


0 

0 


1 0' 


1 


_I ~IM_ 


Lo s- 1 ] 


. Notice that Q Mi) £ 4 — 


QliQt Qpa + Qi — S/Li UjAj is the Schur complement of the matrix 


Q t ] Qiu 

Qji Qm,Zi +Qi~ SyLi T i,jAj 


and since Q f , 1 is positive dehnite by definition (Ej is the covariance matrix of a Gaussian RBF), the condition 
Qm,Si ~ QjiQi.Qiii + Qi ~ J2j= 1 T i,jAj 0 is equivalent to X. t 0. 


Notice that there may be cases where multipliers 77 do not exist unless additional boundedness assumptions are 
imposed on the set A. In what follows we restrict the structure of the reach-avoid problem and illustrate how one 
can express the riglrt-hand-side of the constraints in (5) as a sum of Gaussian RBFs. We then use the result of 
Lemma 3 to construct upper bound approximations for the reach-avoid value functions. 


3.2 Upper bound for the reach-avoid value functions 


We impose the following assumptions in order to express the optimization problem in (5) into the framework of 
Gaussian RBFs. The process involves two steps: we first derive a general result concerning the expectation of a 
Gaussian RBF sum applied to a random variable whose density function is a Gaussian RBF sum and then apply it 
to the constraints in (5). We make use of the following: 


Lemma 4 Given a bounded set A C 3?", there always exists a Gaussian RBF sum pa such that pa(x ) > 1 , 4 ( 2 ;) for 
all x £ 3?”. 


The proof of Lemma 4 is omitted since it is trivial to show that even a single Gaussian RBF in the form of (6) is 
enough to upper bound the indicator function of a bounded set by simply taking a large enough scaling weight. 


Assumption 2 The stochastic kernel Qf\x,u) is defined through a Gaussian RBF sum density function of the form 
w j‘t > (y > Tj{ x i u )i Sj) for J £ N+ and functions pj affine in ( x , u ). 


The condition imposed on the transition kernels by Assumption 2 holds for a wide range of stochastic dynamical 
systems. For example, linear systems affected by general additive Gaussian mixture noise satisfy Assumption 2 and 
can be used as a basis to approximate the kernel of more general non-linear systems. 


Lemma 5 (RBF expectation) Let y € 3?" be a random variable distributed according to a Gaussian RBF sum 
density function of the form 'Yhj = \Wj(j>{x,iij,Y , j) for some J £ N+ and x £ 3? n . Consider a Gaussian RBF sum 
function g £ T M defined by g(x ) = w i ( l > ( x 7 t l ii Sj) for any x £ 3?™. The expected value of the random variable 

g(y) is a Gaussian RBF sum given by Y^fLi /Uj=i u>iWj(j>(p-j, Hi, Sb + Ej) 
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PROOF. Consider the function g(x) = XE w i<j>{x, Adi E$) and a random variable y G 3?" with density function 
given by Xj=i Wj</)(x, pj,flj), x G 3?” . We have that 

- M J 

K [g(y)\ = / E Wi(t>(x, LM, E*) Y] Wj<j)(x, dx 

•'»"i=l 3=1 

M J 

EE WiWj(f)(x, m, T,i)<j>(x, Hj, Sj) dx 
*=l j=i 

M J 

= EE^ 

*=1 i=l 
M J 

= E E WiWjtKfij ,!*,+ X) 

*=1 i=i 

The last equality follows from the fact that the product of two functions in the form of (6) is proportional to another 
function of the same form [13, Section 2]. With 


4>(x, in , Si)^(x, dx 



(j){x,= o(//,.// y . S, + Ej)(^(a:,/i,E), 

p = (E” 1 + EJ 1 ) 1 (Sr 1 ^ + E” 1 /^), and E = (EG 1 + E” 1 ) \ The result follows as the proportionality constant 
(f>(H », pj, Si + Sj) is independent of a: and the function <j>(x, p, S) integrates to 1 over R n . 


Proposition 6 (Upper bound) Consider Lemma 4 and Assumption 2. LetVf(x) = pk(x) wherepx upper bounds 
1 k- The following recursion of optimization problems provides an upper bound on the reach-avoid value function for 
each k = T — 1,..., 0, for any choice of non-negative measure v. 


Vf = arg min 
Me A 

subject to 


U fc (a;)^(dx) 


ix 


V k (x)> / V/f +1 (z)Q(dz \x, u), 


ix 


V k (x) >1 \/x £ K. 


V(x, u) G X xU 


( 8 ) 


PROOF. We use induction. The reach-avoid value function at time T — 1 is defined by 

Vff_ 1 {x) = max { \k{x) + l^(x) [ Vf (z)Q(dz \x,u) \ fix & X . 

ueu [ J x J 

By definition we have that Vf(x) = Pk(x ) > 1 k(x) ='■ Vf(x) f° r x G X. We also have that for all (x, u) € X xU, 
f x Vf(z)Q(dz\x,u) > J x Vf(z)Q(dz\x,u). Hence, for any feasible point Vt-i of (8), we have that Vt~i(x) > 
t x( x )Ix Vf(z)Q{dz\x,u) > t x (x)f x Vf(z)Q(dz\x, u ) for all (x,u) G X xW. Moreover, the constraint Ur_i(a;) > 
1, \/x G K ensures that all feasible functions satisfy Vt- i(x) > lx(x) for all x G X. We conclude that for any 
feasible function in (8) it holds that Vt- i(x) > t K {x) + t x (x) f x Vf(z)Q(dz\ x,u ), \/(x,u) G X x U and it thus 
upper bounds Vf_ 1 (x). Using induction the same result follows for all k = T — 2,..., 0. 

The final step of the process is to derive a tractable convex optimization program to construct a feasible solution 
to the problem in (8). We can then use it recursively to construct upper bounds to the value function for all 
k = 0,..., T — 1. First notice that from Lemma 4 and Assumption and 2 both Vf and Q(-\x, u) are Gaussian RBF 
sums and the right-hand-side of the first constraint in (8) is an RBF expectation that according to Lemma 5 can 
be written as another Gaussian RBF sum. As a result, by choosing the number of basis to construct T according to 


7 



the number of elements in the RBF expectation in the constraints of (8), we can use the RBF dominance result in 
Lennna 3 to reformulate the constraints into sufficient semidefmite constraints. Notice however that depending on 
the number of RBF elements used to describe Q and the number of RBF elements used to approximate the value 
function at step k = T, the number of element required to build T at every step of the recursion may increase, 
affecting the computational complexity of the process. The following proposition provides a tractable semideffihte 
program (SDP) to construct an upper bound to the reach-avoid value function. 


Proposition 7 Consider Lemma 4 and Assumption 2 and further assume that the sets K C K' and S := X xU = 
(. K'\K ) xU are quadratic sets in R ra and R" +m respectively. Let V£ +l (x) = Y^iLi Wif>{x, /q, £*) he a known Gaussian 
RBF sum for some M G N+ for which it holds that Vk+i{x) > V^ +1 {x) for all x € X. Consider a density function 
for Q given by a single Gaussian RBF defined as q(y,x,u) = 4>(y,po(x,u),Y, 0 ) with the function po : X x U —> X 
defined as po(x,u) = Ax + Bu where A,B are known matrices of appropriate dimensions. Let v be a probability 
measure over X and consider the following semidefinite program: 


where 


with 


nnn 


subject to 


M 

J2 y i + 5 tr (^') 

i—1 

Xi^O 
Xi^O 

T iiPi >R 

fli G ffi” 
ti G Si 


0 } i = 


(9) 


X,; = 


-l 

% 

il 

Mi 


Qn 


Qi + ^Mi.Si — Sj=i n,jS. 


xi = 


Q t ] Qfn 
Q„. Qi-T,f=iPijK 



0 0 

0 


0 0 0 


1 

o 

O 

i—l 

Qi = 

0 0 

0 

i Q fli — 

0 0 0 

) Qti — 

0 10 


0 0 2 (^ji 



<i 

1 

o 

hH 


0 0 E- 1 


' A T Z t A 

A T %B -v4 t E, 

Fi\ 

0 0 

0 



Q 




dT i 


b t h 

-pj%A -p : 


-B T Eip. 

-pJ%B pJtiiPi 


Qi = 


0 0 
0 0 2 (y, 


0 




and yi := log 




, Ei = (Ei + Eq) . Let {ft*, E*, y*, r*, p*}iL\ denote the optimal solution of (9) and set 


h* = pV, 


i y^E* |. The function constructed by V*{x) = J2iLi w*<j> (x, /t*, E*^ is a feasible solution for (8). 



PROOF. First we show how the constraints of problem (9) are sufficient for the constraints of (8). Given the format 
of functions V£ +1 and q we have from Lemma 5 that 


lx 


Vf(z)Q(dz\x, u) = / ^ Wj<j)(z, Hi, Ej)q(z, x , u) dz 

Jx i= i 
r M 

= / yy Wj<t>(z, Hi, T,j)(j){z, hq(x, u), Sp) dz 

Jx <=i 
M 

= yy Wj<j) {no(x, u), in, Sj + e 0 ) . 


Since U(x) = fii, Ej), the constraint F(x) > V^(2)<5(dz |x, it) in (8) is equivalent to the RBF 

dominance constraint 

M M 

y W>i0(x, fii, Ej) > (/x 0 (x, u), /x*, Ej + E 0 ), V(x, u) £ S 

i=1 i=l 


which according to Lemma 3 is implied by the constraints X* 0 and r* > 0 for i = 1,..., M. The form of the 
matrices Qi, Qm,^, Qpn Q±. is equivalent to the one derived in the proof of Lemma 3, considering the variable 


, the function po (x,u) = Ax + Bu and the transformation y* = log 


vW 


. We replace the second 


constraint V(x) > 1, Vx £ I\ by the sufficient condition fii, Ej) > Y2iLi jj: Vx £ which is again 

an RBF dominance constraint that according to Lemma 3 is implied by the constraints Xi )>= 0 and r,; > 0 for 
i = 1,..., M. Finally, the cost function in (8) reads 



M 


M 


M 


/ yy Wi<t>{x, fii, Ej) dx = yy wi = yy < 

' X i—i i—i i—i 



which we replace with the sum of the logarithms, i.e. 'Yf!f = \ 1°§ 



E ,=1 Vi + \ log(|Ej|) motivated by the 


fact that the logarithm is a monotone function. Still, the elements ,) log(|Ej|) are concave in E, ; and the minimization 
is a non-convex problem. Using the fact that \ log(|Ej|) = \ log(n2=i ^d) = \ Ed=i l°g(^d) and tr(Ej) = Ed=i ^\ 
we employ | tr(Ej) as a heuristic to approximate the terms \ log(|Ej|). 


Feasibility of the problem in Proposition 7 is not easy to prove without additional assumptions. One way to guarantee 
a feasible solution is to add the constant function c(x) = l,Vx £ X in the basis set used to construct J. Notice that 
we have expressed the kernel Q as a single Gaussian RBF to simplify the notation in (9); extensions to Gaussian 
RBF sums with varying functions no(x,u) are straightforward. 

The SDP-based procedure presented in Proposition 7 can be used recursively, starting at k = T— 1 and Vff (x) = pk{x) 
for all x £ X, to construct upper bound approximations to the value functions for every k = 0,..., T — 1. Note that 
the process introduces conservatism due to the RBF dominance constraints used in (9) which are only sufficient 
conditions for the original robust constraints in (8). Moreover, the cost function in (8) is non-convex jointly in 
the weights, means and variances and the approximation used in Proposition 7 constitutes a design choice that 
can affect how close the bounding functions are to the optimal ones on different areas of the state space. Finally, 
the state-relevance measure is assumed to be a probability measure on X but other choices allowing calculation of 
the integral j x U(x)^(dx) as a function of the RBF parameters can be used (for example measures supported on 
hyper-rectangular sets - see [11]). Algorithm 1 summarizes the process of using Proposition 7 recursively to compute 
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Algorithm 1 Approximate value function 

1: Input Data: 

• Quadratic set constraints for X, K and U. 

• Reach-avoid time horizon T. 

• Center and variance of the MDP kernel Q. 

• Gaussian RBF sum pk for which it holds that Pk(x) > 1 k(x) for all x £ X. 

2: Design parameters: 

• State-relevance measure v. 

• Approximation of the cost function f x V(x)u( dx). 

3: Initialize Vp{x) «— Pk(x). 

4: for fc = T — 1 to /c = 0 do 

5: Extract the weights, means and variances {wi, from the known function V£ +1 . 

6: Solve the optimization program in (9) to obtain {/t*, E*, y*}f£i- 

7: Compute u>* = e v »* for i = 1,..., M. 

8: Set y fc * {x) = </>(u At, s*). 

9: end for 

bounds down to time k = 0. Each SDP in the process has a total of 3 M semidefinite constraints where the ones 
corresponding to A,; are of dimension n + m + 1, the ones corresponding to X, are of dimension n + 1 and the ones 
corresponding to Eare of dimension n. If more than one Gaussian RBF elements are used to describe the process 
kernel Q , the total number of semidefinite constraints grows combinatorially in the elements of Q and M and linearly 
in the horizon length (see Lemma 5). 

3.3 Approximate control policy 

Given the value function V fc * at each k = {0 ,... ,T — 1}, the problem in (3) implicitly defines the optimal reach- 
avoid control policy at time k. Using the sequence of approximate value functions Vq ,..., V^_ 1 constructed using 
Proposition 7 recursively, one can compute approximate control policies itk. at each step k = {0,... ,T — 1}. If the 
kernel Q has a Gaussian RBF sum density function q(y, x, u ) = (f>(y , Ax+Bu, E), as in Proposition 7, given V£ +l {x) = 
12iLi Wi4>(x , Hi, Ej), we define the SDP-based approximate control policy at each time step k = {0,..., T — 1} as: 


n k (x) = argmax t K {x) + ly(^) / ^4*+i( 2: ) ( 3(dz \x, u) 

u&A ( J x 

= arg max / V^ +1 {z)Q{d7,\x,u) 

u&A J x 
M 

= arg max Wi<j)(Ax + Bu, im, + E) 


where we have used Lemma 5 to compute the expectation in (*). Despite the fact that the optimization problem in 
(10) is non-convex, standard gradient based algorithms can be employed to obtain a local solution. In particular, the 
cost function is by construction smooth with respect to u for a fixed x £ X and the gradient and Hessian functions 
can be analytically computed. Moreover, the decision space U is typically low dimensional (in most mechanical 
systems for example dirnW < dim A) and mature software is available [14] to compute locally optimal solutions. 
The process of calculating a control input at time k for a fixed state Xk is summarized in Algorithm 2. Alternative 
approaches include using randomized techniques similar to the approaches in [15,16,17,18], general purpose non¬ 
linear programming solvers (e.g. interior point method) or directly gridding the control space, if computationally 
feasible, as done in the second example of the next section. 


4 Numerical examples 

In this section we evaluate the SDP-based upper bounds calculated using the process of Algorithm 1. We first 
investigate the suboptimality of the bound and the performance of the associated control policies by comparing 
them to value functions and policies computed via space gridding. Since space gridding is limited to low dimensional 
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Algorithm 2 Approximate controller 

1 : for k £ {0, ..., T — 1} do 
2: Measure the system state Xk- 

3: Compute the gradient and hessian of YloLi w i ( t ) (Axk + Bu , /q, E 4 + E) with respect to u. 

4: Solve the optimization problem in (10) to obtain the control input nk(xk)- 

5: Apply the calculated control input to the system. 

6 : end for 



Fig. 1. Reach-avoid safe and target sets with example Fig. 2. Value function bound error at time k = 0 

trajectories 

state-input spaces (dim(A x U) around 4-5), we also evaluate the bounding value functions on a simple benchmark 
problem where we assume an LQG controller is a suitable heuristic to maximimize the reach-avoid probability. All 
simulations were carried out on an Intel Core i7 Q820 CPU clocked @ 1.73 GHz with 16GB of RAM memory, using 
MOSEK’s SDP solver in its default settings. 

4-1 Reach-avoid probability bounds compared to space gridding 

Consider the reach-avoid problem of maximizing the probability that the state of a controlled linear system subject 
to additive Gaussian noise reaches a target set around the origin within T = 5 discrete time steps while staying in 
a safe set for all previous times. We consider systems described by the equation 

x k +i = Ax k + Bu k + w fc (11) 

where for each k = {0, ...,4}, Xk £ X = R n , Uk £ U = {u £ < p 2 } with p u = 0.1 and each 

Wfc is distributed as a Gaussian random variable tOk ~ 7V"(0 nx i, E w ) with E^ £ S". We consider a target set 
K = {x £ K"|a; T Q t x < p 2 } with p t = 0.1 around the origin and a safe set K = {x £ M"|:r T <2 s x < p 2 } with 
p s = 1 (see Figure 1). In order to easily change the dimensions of the problem (variables n,m) and keep the system 
behaviour between problems similar, we fix A = l nxra , B = 1 nxm an d = 0.001 ■ l nxn - 

For problem dimensions up to n = 3 and m = 2, we compute the gridding-based value functions by discretizing the 
spaces [— p s , p s ] n and [—/?«, p u ] m and solving the recursion in (2) on the resulting grid, initialized with Vf ( x ) = 1 k { x ). 
Table 1 reports the space discretization for each problem dimension ( N s corresponds to number of elements in each 
dimension of [— p s ,p s ] and N u corresponds to number of elements in each dimension of [— PmPu\) along with the 
time it took to compute a single iteration (one horizon step) of the recursion. The upper bound value functions 
are constructed using the steps of Algorithm 1. The transition kernel of (11) has a Gaussian RBF sum density 
Xk+\ ~ A f{Axk + Buk^ui) with center Axk + Buk and covariance matrix E w while the sets K , K' and IA are by 
definition quadratic sets, satisfying the input requirements of the algorithm. As in Proposition 7, we choose a uniform 
state-relevance measure v supported on X. The only remaining element to execute the algorithm is to construct 
an upper bound function px for the indicator function 1^-. We achieve this by randomly placing Gaussian RBF 
elements on K with diagonal covariance matrices £*, = 0.0005, choosing their weights by means of a linear program, 
making sure that they upper bound 1 k on every point of the grid placed on X (see [11] for details). Table 1 reports 
the number of elements used to upper bound tx for the different cases of dim(A) = 1, 2,3. 
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n 

m 

N s 

N u 

M 

Grid (sec.) 

SDP bound (sec.) 

To* - To* 

* 

TV — TV 

i 

i 

80 

25 

10 

0.4262 

0.62 

0.34 

0.035 

2 

i 

80 

25 

50 

115 

1.88 

0.174 

-0.0186 

2 

2 

80 

25 

50 

1237 

1.83 

0.472 

-0.087 

3 

1 

30 

15 

90 

477 

4.8 

0.053 

0.0083 

3 

2 

30 

15 

90 

2926 

4.15 

0.186 

0.031 


Table 1 

Comparison of grid-based and upper bound value functions 

To estimate the performance of the approximate control policy, we sample 100 initial states Xq uniformly from X 
and for each one generate 100 different noise trajectory realizations, using the grid-based and SDP-based policies 
computed by (3) and Algorithm 2 respectively. We then count the number of trajectories that successfully complete 
the reach-avoid objective, i.e. reach K without leaving K ', making sure that u £ U. In Table 1 we denote by 7r — 7r* 
the mean difference between the empirical success probability of the SDP-based controller (If) and the empirical 
success probability of the controller based on the value function computed via space gridding (7r*). The value of 
y* _ y* corresponds to the mean difference between the values of the upper bound (Pq*) saturated to 1, and the 
gridding-based value function (fg*) at time k = 0, over the grid elements within the set X. Analyzing the results, we 
notice that the approximate control policy based on the value function upper bound yields near optimal performance 
even though the average absolute error is non-zero. Overall, the SDP-based controller outperforms the grid-based 
controller, probably due to coarse discretization of X and U. In terms of computation time, the SDP-based approach 
significantly outperforms space gridding. The time values reported in Table 1 correspond to the computation time 
required to solve one step of the recursions; it is clear that the curse of dimensionality sets a fundamental limit to 
potential applications of space gridding. 

4-2 High-dimensional problems 

In this section we demonstrate the scalability of the SDP-based method by comparing the performance of the control 
policy computed using the value function bounds constructed with Algorithm 1 to an LQG controller. Instead of 
using Algorithm 2 to generate the control policy, we discretize the control space U with 20 points in each dimension, 
resulting in a 20 m grid over which we compute the approximate control policy. We refrain from using non-linear 
programming solvers in this comparison as computation times grow significantly with the dimension of U and are 
impractical for Monte Carlo-type verification simulations. 

We consider the same regulation problem as in Section 4.1 but randomly choose the matrices A £ K" xn and 
B £ R" xm in every experiment such that the resulting systems are open loop stable and controllable under zero 
noise. If the system is unstable, the reach-avoid probability over a short horizon can be close to zero, especially when 
dim(A) >> dirn(W). For the series of experiments carried out in this part, we upper bound \k in a different way 
than in Section 4.1. We use an optimization problem in the form of (9), without the constraints A, A 0, r,; > 0, thus 
getting an upper bound for 1^-. This bound will not be as tight as the one computed with the gridding method of 
the previous example and as a result, all subsequent value function bounds will be less accurate. Our purpose here, 
however, is to compare controller performance and hence any bound for 1 k will work. 

Despite having a different control objective, we assume that using an LQG controller where we choose the state and 
control penalty matrices according to the safe, target and control constraint sets is an efficient heuristic to maximize 
the reach-avoid probability. In particular we use the optimal control policy of the following problem: 

min E 

I«Gr=o 

subject to (11 ),#0 £ X 

where Q and R have been chosen to correspond to the quadratic sets K and Id respectively and penalize directions 
where they are small in size. Whenever the resulting LQG control input (calculated via the Riccati difference 
equation) is infeasible, we project it on the feasible set U. Starting from the same initial sates for both LQG and 
SDP-based controllers, we compare their performance by generating 100 trajectories using the same noise samples, 
counting the number that reach K without leaving K'. To make sure that the initial states chosen have a non-zero 


'T -1 

X k Q X k 


- uj Ruk 


A Xrp Qxt 


( 12 ) 
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n 

m 

it — tvlqg 

LQG control (sec.) 

SDP grid control (sec.) 

SDP setup (sec.) 

3 

3 

0.142 

1.10e-04 

0.0015 

1.21 

3 

4 

0.126 

9.951e-05 

0.0302 

1.11 

4 

3 

0.128 

1.029e-04 

0.0017 

1.1 

4 

4 

0.254 

1.190e-04 

0.0341 

1.09 

6 

3 

0.042 

1.245e-04 

0.0021 

1.08 

5 

4 

0.163 

1.215e-04 

0.0430 

1.07 

6 

4 

0.072 

1.174e-04 

0.0504 

1.094 

5 

5 

0.176 

1.091e-04 

0.846 

1.07 

6 

5 

0.144 

1.256e-04 

0.983 

1.264 

7 

5 

0.133 

1.43e-04 

1.14 

1.33 

8 

5 

0.141 

1.33e-04 

1.224 

1.39 

15 

5 

0.341 

1.55e-04 

2.3 

1.27 


Table 2 

Comparison of LQG and SDP-based controllers based on recursion (2) and Algorithm 1. 

reach-avoid probability we first run the LQG controller and reject states with too low success probabilities. Note 
that since the design criteria of the two methods are different, our comparison is qualitative despite the fact that 
they both drive the system to the origin. Table 2 reports the mean difference between the LQG and SDP-based 
controllers (denoted by 7r — itlqg) over the initial states for different problem dimensions. We notice that the SDP- 
based controller consistently outperforms the LQG controller in terms of success probability but scales worse in 
terms of computation time with the dimensions of the problem. The times reported for the LQG and SDP control, 
correspond to the computation of a single control input given the system state x € X. The last column of Table 
2 reports the time required to construct an upper bound value function for one step of the reach-avoid recursion 
using Algorithm 1. Note that a Gaussian RBF sum with a single element is needed to upper bound Ik in each case 
(M = 1). 


5 Conclusion 


The method presented in this paper is suitable for constructing upper bounds for the reach-avoid value function 
of Markov decision processes whose stochastic kernel can be expressed as a sum of Gaussian radial basis functions 
(RBFs) with affine dependence on the state and control input. In the case where safe and target sets can be written 
as intersections of quadratic inequalities, we derive conditions that satisfy the robust constraints of an infinite 
dimensional optimization problem, equivalent to the reach-avoid dynamic programming recursion. We then provide 
a sequence of semidefinite programs to recursively construct upper bounds for the reach-avoid probability which 
are computationally efficient and scale to problems with much higher state and input dimensions than what can 
be handled in the literature. Our numerical examples indicate that the conditions used to reformulate the robust 
constraints may be conservative but the associated control policies perform close to optimal in low dimensional 
problems and considerably outperform a tuned LQG-based heuristic controller in higher dimensional problems. 

As a part of future research we will investigate methods to reduce the conservatism introduced by the suggested 
constraint reformulation and couple the dominance conditions between Gaussian RBF sum functions. Moreover, it 
is worth comparing the suggested method to similar methods based on quadratic and polynomial approximations of 
value functions of general dynamic programs. We expect our method to provide a compromise between the scalability 
of quadratic approximations and the accuracy of general polynomial ones. 
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